Unifying Vision-Language Representation Space with Single-Tower Transformer

نویسندگان

چکیده

Contrastive learning is a form of distance that aims to learn invariant features from two related representations. In this work, we explore the hypothesis an image and caption can be regarded as different views underlying mutual information, train model unified vision-language representation space encodes both modalities at once in modality-agnostic manner. We first identify difficulties one-tower for pretraining (VLP), propose One Representation (OneR) simple yet effective framework our goal. discover intriguing properties distinguish OneR previous works have modality-specific spaces such zero-shot localization, text-guided visual reasoning multi-modal retrieval, present analyses provide insights into new learning. Thorough evaluations demonstrate potential VLP framework.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature Space Trajectory Representation for Active Vision

A new feature space trajectory (FST) description of 3-D distorted views of an object is advanced for active vision applications. In an FST, di erent distorted object views are vertices in feature space. A new eigen-feature space and Fourier transform features are used. Vertices for di erent adjacent distorted views are connected by straight lines so that an FST is created as the viewpoint chang...

متن کامل

Unifying Low-Level Vision

This white paper supports the goal of establishing Computer Vision as a coherent intellectual discipline by suggesting a specific agenda for the unification of many low-level vision principles, algorithms, and data structures. Our goal is to identify a set of highly related low-level vision problems, define their common structure, and establish a coherent intellectual discipline around the shar...

متن کامل

Unifying Class - Based Representation

متن کامل

Constant Space Complexity Environment Representation for Vision-based Navigation

This paper presents a preliminary conceptual investigation into an environment representation that has constant space complexity with respect to the camera image space. This type of representation allows the planning algorithms of a mobile agent to bypass what are often complex and noisy transformations between camera image space and Euclidean space. The approach is to compute per-pixel potenti...

متن کامل

Unifying Class-Based Representation Formalisms

The notion of class is ubiquitous in computer science and is central in many formalisms for the representation of structured knowledge used both in knowledge representation and in databases. In this paper we study the basic issues underlying such representation formalisms and single out both their common characteristics and their distinguishing features. Such investigation leads us to propose a...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2023

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v37i1.25178